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ABSTRACT 

Recently, the ability to interact with messenger RNA 
(mRNA) has been reported for a number of known 
RNA-binding proteins, but surprisingly also for dif- 
ferent proteins without recognizable RNA binding 
domains including several transcription factors 
and metabolic enzymes. Moreover, direct binding 
to cognate mRNAs has been detected for multiple 
proteins, thus creating a strong impetus to search 
for functional significance and basic physico- 
chemical principles behind such interactions. Here, 
we derive interaction preferences between amino 
acids and RNA bases by analyzing binding inter- 
faces in the known 3D structures of protein-RNA 
complexes. By applying this tool to human 
proteome, we reveal statistically significant 
matching between the composition of mRNA se- 
quences and base-binding preferences of protein 
sequences they code for. For example, purine 
density profiles of mRNA sequences mirror 
guanine affinity profiles of cognate protein se- 
quences with quantitative accuracy (median 
Pearson correlation coefficient /?=-0.80 across 
the entire human proteome). Notably, statistically 
significant anti-matching is seen only in the case 
of adenine. Our results provide strong evidence for 
the stereo-chemical foundation of the genetic code 
and suggest that mRNAs and cognate proteins may 
in general be directly complementary to each other 
and associate, especially if unstructured. 

INTRODUCTION 

In the 50 years since the discovery of messenger RNA 
(mRNA) (1), the relationship between this key biopolymer 
and proteins has been studied predominantly in the 
context of transmission of genetic information and 



protein synthesis. Recently, however, evidence of direct 
non-covalent binding between mRNAs and a number of 
functionally diverse proteins has been provided, including 
surprisingly various metabolic enzymes, transcription 
factors and scaffolding proteins with hitherto 
uncharacterized RNA-binding domains (2-5). It has 
been found that such mRNA-protein complexes fre- 
quently participate in the formation of RNA droplets in 
the cell (e.g. P-bodies), which display all features of a 
separate cytoplasmic microphase and open up new para- 
digms in cell biophysics (6-8). What is more, several 
proteins have been found over the years to directly bind 
their own cognate mRNAs, including among others 
thymidylate synthase, dihydrofolate reductase and p53 
(2,9-14), with binding sites in both translated and untrans- 
lated mRNA regions. The functional significance of such 
cognate interactions has been clearly ascertained in some 
cases [e.g. translational feedback control (12)], but it is far 
from clear how general and functionally relevant they 
actually are. Kyrpides and Ouzounis hypothesized that 
cognate protein-mRNA interactions may represent an 
ancient mechanism for autoregulation of mRNA stability 
(9,10), but structural and mechanistic aspects of their 
proposal have never been explored in detail. Altogether, 
the rapid growth of the number of experimentally verified 
mRNA-binding proteins, both cognate and non-cognate, 
has now created a strong incentive to search for the func- 
tional significance of such interactions and, even more 
fundamentally, the basic physico-chemical rules that 
guide them. 

Related to this, we have recently shown that pyrimidine 
(PYR) density profiles of mRNA sequences tend to closely 
mirror sequence profiles of the respective cognate proteins 
capturing their amino-acid affinity for pyridines, chem- 
icals closely related to PYR (15). These findings 
provided strong support for the stereo-chemical hypoth- 
esis concerning the origin of the genetic code, the idea that 
the specific pairing between individual amino acids and 
cognate codons stems from direct binding preferences of 
the two for each other (16-21). However, based on our 
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results, such binding complementarity may exist predom- 
inantly at the level of longer polypeptide and mRNA 
stretches rather than individual amino acids and codons. 
Stimulated by these findings, we hypothesized that PYR- 
rich regions in mRNAs and protein stretches encoded by 
them may bind each other in a complementary fashion, a 
feature encoded directly in the universal genetic code (15). 
Although strongly suggestive, these findings remained 
silent about the potentially equivalent complementarity 
on the side of purines (PUR) as weU as any details con- 
cerning specific nitrogenous bases. In addition to confirm- 
ing our previous results using a completely orthogonal 
approach, the present study provides strong novel 
evidence along both of these two key fines. 

MATERIALS AND METHODS 

Analysis of contacts between amino-acid side chains and 
RNA nucleobases 

All available structures of protein-RNA complexes (both 
X-ray and nuclear magnetic resonance structures) were 
downloaded from the Protein Data Bank (PDB) (22) in 
September 2012 using the 30% protein sequence identity 
and 3 A resolution (for X-ray structures) cutoffs. The 
initial set was further manually filtered to exclude 
complexes containing double-stranded RNAs or mature 
transfer RNAs. The structures of the complete 
Saccharomyces cerevisiae (23), Escherichia coli (24) and 
Thermus thermophilus (25) ribosomes with the highest 
crystallographic resolution as well as the 50S subunit of 
the Deinococcus radiodurans (26) and Haloarcula 
marismortui ribosome were also included in the set. This 
resulted in a total of 299 individual PDB structures 
(Supplementary Table SI). An amino-acid residue and 
an RNA base were considered to be neighbors and form 
a contact if their centers of geometry were separated by 
less than a given cutoff distance. AU the results reported in 
the main manuscript are given for the cutoff of 8 A, 
whereas for testing jDurposes, this cutoff was also varied 
between 6 and 10 A with a 0.25 A step. We separately 
analyzed contact statistics for residues having at least 
one neighboring base (set '1+' with a total of 25 820 
unique contacts for 8 A cutoff), at least two neighboring 
bases (set '2+' with a total of 16 331 unique contacts for 
8 A cutoff) or include only the two closest neighboring 
bases (set '2' with a total of unique 12 040 contacts for 
8 A cutoff). 

Calculations of amino-acid interaction preferences 

Amino acid/nucleobase preferences e-' (with / = 1, ... ,20 
for amino acids and ./ = 1, ... ,4 for bases) were estimated 
using the foUowing standard distance-independent contact 
potential formahsm with the quasi-chemical definition of 
the reference state (27-31): 



N'\ 
-In- 



-In- 



N\ 

obs 



TOT 



Y Y N' ^ 



(1) 



where A''^',,^ is the number of observed contacts between 
amino acid side chain of type / and nucleobase of type j 



in experimental structures, and Ni,^^^ is the expected 
number of such contacts. The latter is calculated as the 
product of molar fractions of amino acid i and base / 
among all observed contacts (Z,- and Xj, respectively) 
and the total number of all observed contacts A^™^- 

Interaction preference scales of amino acids were 
obtained separately for guanine ('G-preference'), adenine 
('A-preference'), cytosine ('C-preference'), uracil ('U-pref- 
erence'), PUR (both G and A, 'PUR-preference') and 
PYR (both C and U 'PYR-preferences'). 

Proteome data 

The sequences of the complete human proteome (17 083 
proteins) and coding sequences of their corresponding 
mRNAs were extracted from UniProtKB database 
(January 2013 release), with maximal-protein-evidence- 
level set at 4 (i.e. proteins annotated as 'uncertain' were 
excluded) and with only the reviewed Swiss-Prot (32) 
entries used for further analysis. The coding sequences 
of their corresponding mRNAs were extracted using the 
'Cross-references' section of each of UniProtKB entry 
where out of several possible translated RNA sequences 
the first one satisfying the length criterion (RNA 
length = 3 X protein length -I- 3) was selected and its 
sequence downloaded from European Nucleotide 
Archive Database (http://www.ebi.ac.uk/ena). The 
protein as weU as RNA sequences with only canonical 
amino acids or nucleotides were chosen for analysis. The 
complete set of mRNA/protein sequences used herein is 
included in the Supplementary Data. The average content 
of codons when it comes to individual nucleobases or 
PYR or PURs for all 20 amino acids ('codon content' 
scales) was extracted from the thus-obtained cognate 
mRNA and protein sequences. 

Correlation calculations 

Pearson correlation coefficients {R) were calculated 
between nucleobase preferences and 'codon content' 
scales and between sequence profiles of nucleobase 
content for mRNAs and of different amino-acid prefer- 
ence scales for proteins from the complete human 
proteome set. Before comparison, the profiles were 
smoothed using a shding-window averaging procedure; 
the window size of 21 residues/codons was used for aU 
calculations. 

Analysis of statistical significance 

Statistical significance (P-values) of the observed correl- 
ations was estimated using a randomization procedure 
involving random shuffling of the interaction preference 
scales. Each scale was shuffled one milhon times, and 
Pearson correlation coefficients {R) against codon 
content scales as well as for mRNA/protein profiles were 
calculated for each shuffled scale. The reported /"-values 
correspond to the fraction of shuffled scales, which exhibit 
a higher absolute R than the original {\R\ > |-/?oiiginail) in 
the case of codon content comparisons, or for which <R> 
is higher in absolute value than <-/?originai> in the case of 
sequence-profile comparisons. 
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The typical randomized scales whose distributions of 
correlation coefficients are depicted in the manuscript 
were chosen to be those, whose mean and standard devi- 
ation are the same as the average mean and the average 
standard deviation over aU 10* randomized scales in each 
case. 

Analysis of protein disorder and gene ontology (GO) 
classification 

The average disorder for each protein sequence in the 
human proteome was predicted using lUpred server 
(33). Fourteen subsets of proteins displaying best or 
worst matching between their interaction preference 
profiles and nucleobase density profiles of their cognate 
mRNAs in term of Pearson R were extracted from the 
human proteome (top and bottom 10% cohorts) for the 
six cases of direct correspondence between nucleobase 
preferences and nucleobase composition profiles (e.g. 
protein G-preference versus G mRNA content, Gprotein- 
GmRNA. etc.) and for the G-preference versus PUR 
mRNA content one (Gprotein-PURn-,RNA)- Each of these 
subsets contains 1707 proteins, for which average 
disorder values were assigned. Means and standard devi- 
ations of the 14 thus-obtained distributions of average 
predicted disorder were compared with those of the 
entire human proteome (background). The significance 
of the mean difference from the background was estimated 
for each of the analyzed subsets using the Wilcoxon 
signed-rank test. The gene ontology (GO) analysis was 
performed for the same seven top 10% best-matching 
protein subsets using DAVID functional annotation 
server (34). The entire human proteome was used as back- 
ground, and only the most significantly enriched func- 
tional terms with a DAVID EASE score (P-values) 
<10~'° were considered. 

Data visualization 

The 3D structures of protein-RNA and amino acid/ 
nucleobase complexes were visualized using PyMol 
(http://www.pymol.org/) (35). Contact statistics heat- 
map was produced using MATLAB (R2009a). Pearson 
R distributions for mRNA/protein profiles were processed 
and visualized using Grace (http://plasma-gate.weizmann. 
ac.il/Grace/). 

RESULTS 

Derivation of amino acid/nucleobase interaction 
preferences 

How differentiated and context-dependent are the prefer- 
ences of amino acids to interact with specific nitrogenous 
bases? To address this question, we analyze contact inter- 
faces of ~300 high-resolution structures of different 
protein-RNA complexes including five ribosomal struc- 
tures (Supplementary Table SI). We use distances 
between centers of geometry of amino-acid side chains 
and nucleotide nitrogenous bases in combination with a 
fixed cutoff to define contacting neighbors (Figure lA). In 
this way, we isolate sequence-specific protein-RNA 



contacts (36-38) while ignoring non-specific interactions 
defined exclusively by protein or RNA backbones. We 
first present results for the distance cutoff of 8 A following 
Shakhnovich et al. who established cutoffs between 7 and 
8 A to be optimal for residue-based statistical potentials 
describing protein-DNA interactions, albeit with a 
sHghtly different definition of reference points (28). 
However, all of our principal findings hold qualitatively 
for cutoffs between ~6 and 9 A as discussed later in the 
text. Finally, to differentiate cases in which an amino acid 
interacts with a single base only from denser, potentially 
more stereospecific contacts with more than one neighbor- 
ing base within the cutoff, we separately merge contact 
statistics over the whole set of studied structures for 
amino acids having at least one neighboring base 
(set '1+') or at least two neighboring bases (set '2+', 
Figure lA) within the cutoff. 

Using standard distance-independent contact potential 
formahsm (27-31), we subsequently derive scales of amino 
acid/nucleobase interaction preferences (Figure IB and 
Supplementary Table S2) and use them to address the 
following questions: (i) how does the average composition 
of mRNA codons coding for a given amino acid relate to 
the preferences of this amino acid to interact with different 
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Figure 1. Derivation of amino acid/nucleobase interaction preference 
scales from known structures of RNA/protein complexes. (A) We 
define amino-acid side chains and RNA bases in a given complex to 
be contacting neighbors if their centers of geometry are less than a 
given cutoff radius R apart (left and middle) and merge contact statis- 
tics over the entire set of studied structures (right, '2+' set with applied 
8 A cutoff). (B) Interaction preference scales of amino acids (in arbi- 
trary units) for binding to guanines (G), PYR and PUR obtained from 
set '2+' statistics using 8 A cutoff (panel A, right). The scales are stat- 
istical analogs of relative free energy of binding (see "Materials and 
Methods' section) with the prominently negative values corresponding 
to amino acid side chains having the highest affinities for bases of a 
given type and vice versa. 
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nucleobases at protein/RNA interfaces and (ii) how does 
sequence density of different bases in mRNA-coding se- 
quences relate to sequence profiles of amino-acid inter- 
action preferences for tliese and other bases in cognate 
protein sequences? 

Amino acid interaction preferences and their codon 
content 

We first focus on contact statistics from set '2+'. 
Dinucleotides were found previously to exhibit potential 
for specific recognition of amino acids at protein-RNA 
interfaces (39) and have also been suggested as potential 
catalysts for amino acid synthesis in pre-biotic environ- 
ments (40). Moreover, set '2+' by definition also 
includes all instances where triplets of bases directly 
contact a given amino acid, which may be relevant in 
the context of the genetic code. Using set '2+' statistics, 
we observe a remarkably strong correlation between pref- 
erences of amino acids to interact with guanine (G-prefer- 
ence, Figure IB) and the average PUR content of their 
respective codons as derived from the complete human 
proteome with Pearson correlation coefficient R of 
—0.84 (Figure 2A). Negative Pearson correlation coeffi- 
cients indicate matching between amino acid preferences 
and codon content owing to the way preference is defined 
(see 'Materials and Methods' section). Put differently, 
amino acids, which are predominantly encoded by 
PURs, display a strong tendency to co-localize with G 
at protein-RNA interfaces. This is also true, albeit at a 
somewhat weaker level of correlation, for matching 
between PUR composition of individual codons from 
the standard genetic table and the respective G-prefer- 
ences if the statistics of codon usage in the human 
proteome is not included {R = —0.68, Supplementary 
Figure SI). The observed signal for G is statistically 
highly significant as evidenced by randomization calcula- 
tions (/"-value < 10"'', Figure 2B). Related to this, G-pref- 
erence of amino acids inversely correlates with C and U 
content of their codons (Figure 2B). Somewhat less prom- 
inent, but still extremely significant correlations are 
observed for G- and C-preference of amino acids and 
the average G- and C-content of their codons {R of 
—0.47 and —0.58, respectively). On the other hand, the 
interface statistics for adenine (A) and uracil (U) do not 
correlate with their average usage in codons. In particular, 
the A-preference of amino acids correlates inversely with 
the A-content (R = 0.59) or directly with the U-content of 
their codons (R = —0.51), whereas the U-preference 
exhibits relatively low correlations throughout 
(Figure 2B). Finally, both PYR and PUR binding prefer- 
ences of amino acids (Figure IB) display significant cor- 
relations with PYR and PUR fraction in their codons with 
R of —0.54 and —0.53, respectively, and /"-values < 10"'' in 
both cases. In other words, amino acids coded for by 
PYR-rich codons prefer to co-localize with PYR, and 
those coded for by PUR-rich codons with PUR at 
RNA-protein interfaces. Although similar in the present 
case, PYR- and PUR-preference scales need not necessar- 
ily be inverses of each other owing to the way preferences 
are defined, and we therefore here report and discuss both. 



Matching between sequence profiles of mRNAs and their 
cognate proteins 

How do these observations translate if one compares 
complete mRNA-coding sequences with their cognate 
protein sequences? Owing to codon usage bias and non- 
uniform amino-acid composition of the human proteome, 
these results could in principle deviate significantly from 
the results obtained for individual codons and amino 
acids. To address this question, we calculate a Pearson 
R for every cognate mRNA/protein pair in the human 
proteome capturing the correlation between each mRNA 
sequence composition profile with the base-binding pref- 
erence profile of its cognate protein sequence. 
Remarkably, we observe an extremely high level of 
matching between PUR density profiles of mRNAs and 
G-preference profiles of cognate protein sequences with a 
median Pearson R (i?median) over the entire human 
proteome of —0.80 and a low /"-value (<10"^) as 
determined by randomization (Figure 2C). In particular, 
the distribution of Pearson R values for this scale over the 
human proteome is significantly left shifted and shows 
only marginal overlap with the one calculated 
for a typical randomized interaction preference scale 
(Figure 2C). For illustration, we present sequence 
profiles for proteins of most abundant length (300^00 
amino acids. Supplementary Figure S2) displaying 
typical (i.e. exhibiting a Pearson R equal to the population 
median) or best levels of correlation (Figure 2D). As is 
evident, the PUR density of niRNAs is quantitatively ex- 
tremely well predicted by the G-binding preference profiles 
of cognate proteins even for typical human proteins 
(-^median = —0.80 and P< 10"^). Wc also observe signifi- 
cant matching between C-preference profiles for protein 
sequences and both C- and PYR-density profiles of their 
cognate mRNAs with i?median of —0.55 and —0.47, re- 
spectively (Figure 2E). In contrast, the A-preferences 
display significant matching with PYR-density profiles 
on the side of mRNA (Figure 2E; see also 
Supplementary Table S2 for the full report of profile cor- 
relations) with ^median of —0.53. Finally, strong and sig- 
nificant level of matching is observed for PYR-binding 
preferences of amino acids and PYR mRNA profiles as 
well as PUR-binding preferences of amino acids and PUR 
profiles (/^median of —0.58 iu both cases and /"-values of 
8.6 X 10"^ and 7.9 x 10"^^, respectively, Figure 3A and C). 
From the exemplary typical and best profiles (Figure 3B 
and D), it is clear that the PYR- and PUR-rich regions in 
mRNA code for stretches of amino acids in cognate 
proteins, which prefer to co-localize with PYR and PUR 
bases, respectively, at protein-RNA interfaces in the 
known 3D PDB structures. The typical level of similarity 
between sequence profiles is actually greater than what 
one might infer from i?niedian values, suggesting that 
Pearson correlation coefficient might not even be the 
optimal measure of deviation in this case. Importantly, 
this direct physico-chemical complementarity between 
mRNA and cognate protein sequences may be indicative 
of pronounced potential for complex formation between 
them, especially under circumstances when 'peak' regions 
become available for such interactions. Given the fact that 
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Figure 2. Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels. (A) Correlation between G 
interaction preferences of amino acids (Figure IB) and the average PUR content of their codons in mRNAs of the entire human proteome. 
(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids ('scl') and average base content of their 
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a significant matching of profiles is detected at the level of 
primary sequences, we propose that the presence of ex- 
tended unstructured protein and mRNA segments may 
be required for such binding. This suggestion agrees well 
with recent knowledge-based studies where RNA loops 
and bulges were found to be more hkely to interact with 
amino-acid side chains in a specific manner (38,41). 

How sensitive is the level of matching to the choice of 
cutoff distance used to define contacting amino acids and 



nucleobases in protein/RNA complexes? To address this 
question, we have repeated the aforementioned analysis 
for a range of different cutoff values going from 6 to 
10 A in steps of 0.25 A (Figure 4). Overall, for set '2+', 
our findings are largely robust to the choice of the exact 
cutoff in this range, albeit with a somewhat lower level of 
significance for longer cutoffs. However, the majority of 
the signal is lost if one uses the '1+' set, except for 
G-preference and PUR-content (Figure 5A) and 
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Figure 3. PYR/PUR niRNA sequence profiles strongly match PYR/PU] 
and D) amino-acid preference scales are given in Figure IB. For details, 



Serine dehydratase-like (Q96GA7) R - -0.58 




Protein ATP1B4(Q9UN42)R= -0.94 > 




Protein FAIVI50A(Q1 4320) R= -0.93 




0 100 200 300 400 



mRNA codon / protein residue 

l-preferences of cognate protein sequences. PYR (A and B) and PUR (C 
please see the analogous captions to Figure 2C and D. 



A-preference and PYR-content (Supplementary Table 
S2). This observation strongly suggests that close dense 
packing of nucleobases around amino acids may be 
required for specificity in cognate complex formation. 
Although interfaces may be dynamic and liquid-like, as 
we have suggested before, they may still need to be 
densely packed. Interestingly, if one reduces the 2+ set 
by including only the two closest bases in contact with a 
given amino acid (set '2'), the signal for G-preference/ 
PUR-content even further improves by several percentage 
points (Figure 5A), and the same holds for C-preference/ 
C-content and A-preference/PYR-content (Supplemen- 
tary Table S2). 

To further study the role of protein structural disorder 
in matching, we have analyzed the levels of the predicted 
disorder of the top and the bottom 10% of proteins when 
it comes to the degree of mRNA/protein profile matching 
as captured by Pearson R coefficient (see 'Materials and 
Methods' section). We have done this for the six cases of 
direct comparison whereby the same base type is used for 
both protein preference and mRNA profile density 
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for the top and the bottom 10% cohorts to be significantly 
enriched (top 10%) and depleted (bottom 10%) in dis- 
ordered proteins (Supplementary Table S3), whereas in 
the case of Uprotein-Uprof matching, the situation is 
reversed. Interestingly, for PURprotein-PURmRNA. 

PYKprotein-PYRmRNA ^ud Gprotein-PURmRNA matching, 

one observes slight disorder enrichment in both top and 
bottom cohorts. The most prominent shift of the distribu- 
tion of predicted average disorder toward higher disorder 
as compared with background is observed for the top 
10% cohort of proteins displaying strong matching 
between C-preference profiles of their sequences and the 
C-content of their cognate mRNAs (Cprotein-CmRNA, 
Supplementary Table S3, Supplementary Figure S3). 
One might argue that this effect could just be related to 
compositional properties of such protein and mRNA 
pairs, whereby disordered proteins are simply encoded 
by C-rich sequences. However, the differences between 
nucleobase compositions of mRNAs from the Cprotein- 
CmRNA top 10% cohort and the complete proteome are 
minor, suggesting that the underlying explanation might 
be more complex (Supplementary Figure S3). 

Which biological functions might be associated with a 
high level of complementarity between proteins and 
cognate mRNAs? To address this question, we have per- 
formed GO analysis for seven different top 10% subsets of 
proteins displaying strong matching with cognate mRNAs 
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Figure 4. Effect of cutoff radius used to define protein-RNA contacts 
on observed correlations. (A) Dependence of Pearson correlation coef- 
ficients (R) between amino acid preference scales and average codon 
content on the cutoff radius for the two sets of statistics studied ('1+', 
'2+'). The total number of unique contacts in '1+' and '2+' (given in 
parentheses) sets obtained for each of used cutoff radii is indicated at 
the top of the panel. (B) Cutoff radius dependence of median pairwise 
Pearson correlation coefficients (i^median) for comparison between 
nucleobase content profiles of mRNAs and base-preference-weighted 
protein sequence profiles over the entire human proteome (color code 
the same as in panel A). 
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Figure 5. Physico-chemical origins of the mRNA/protein relationship. 
(A) Correlation coefficients {R and <R> with standard deviations) 
between PYR or PUR average codon content ('Codon content') and 
respective mRNA profiles ('Profiles') calculated for G- (blue), PUR- 
(red) and PYR- (green) binding preferences of amino acids, which 
were obtained using different amino acid neighbor statistics (1-I-, 2+ 
or 2). (B) A model of physico-chemical complementarity between 
proteins and cognate mRNAs. Preferential interactions of amino 
acids with PYR or PUR define their codon content in the genetic 
table and facilitate complementary interactions between PYR/PUR- 
rich mRNA regions and PYR/PUR preferring regions in proteins. 
The opposite behavior of adenines and guanines adds an additional 
layer of complexity in the case of PURs as signified by dashed 
arrows in the model. Note: polyiner sizes not drawn to scale. 



(see 'Materials and Methods' section for details). In 
Supplementary Table S4, we report the most significantly 
enriched biological functions (using a /"-value cutoff of 
10"'°) shared by proteins from the analyzed cohorts. In 
a striking agreement with our hypothesis, in most cases, 
we observe pronounced enrichment of terms related to 
nucleic-acid/protein interactions, including regulation of 
RNA metabolic processes, ribonucleoprotein complexes 
and transcription. The latter, in particular, aUows one to 
speculate that protein tendencies to associate with cognate 
mRNA might be used by the ceUs to modulate gene ex- 
pression pathways. What is more, PUR or PYR density 
profiles of mRNAs are identical to PUR or PYR density 
profiles of coding-strand DNA sequences (with Us being 
replaced by Ts). Although based on our statistical poten- 
tials, we cannot say anything about T-binding preferences 
of amino acids, it is possible that our results may be gen- 
eralizable even to DNA-protein interactions as well as 
other RNA-protein interactions. One should also 
mention that depending on the particular type of 
matching, other biological functions also tend to be 
enriched. For instance, the Upiotein-UmRNA top 10% 
subset displays significant enrichment of membrane 
proteins, whereas Gprotein-PURmRNA top cohort seems to 



be populated by extracellular proteins and particularly 
those involved in the functioning of the innate immune 
system. Altogether, our preliminary GO analysis illus- 
trates significant functional differences between proteins 
that strongly complement their cognate mRNAs and the 
rest of the human proteome, and these findings will be 
further explored in another manuscript. 



DISCUSSION 

High levels of matching between base-binding-preference 
profiles of proteins and PYR- or PUR-density profiles of 
cognate mRNA-coding sequences, defined primarily by 
amino acid preferences to co-localize with G and C 
bases at RNA/protein interfaces, allow one to speculate 
that direct complementary binding interactions may be a 
key element underlying the whole mRNA/protein rela- 
tionship when it comes to both its evolutionary develop- 
ment as well as present day biology (Figure 5B). This 
agrees well with and significantly extends our previous 
findings where we have shown that protein sequence 
profiles of amino acid affinity for PYR analogs (42-A4) 
mirror PYR density profiles of cognate mRNA sequences 
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(15). It should be emphasized, however, that our present 
results are based exclusively on the statistics of direct 
amino acid/nucleobase contacts at RNA/protein inter- 
faces. It is therefore still possible that the driving force 
for interactions between mRNAs and cognate proteins is 
non-specific (e.g. binding of positively charged amino acid 
side chains to RNA phosphate groups), whereas comple- 
mentary interactions actually confer specificity to binding. 

Moreover, our results provide a clear evolutionary per- 
spective concerning the physico-chemical origins of trans- 
lation in line with the stereo-chemical hypothesis of the 
origin of the genetic code (16-21). In particular, our 
results give strong support to the possibihty of direct 
teniplating of proteins from mRNAs in the era before 
the development of ribosomal decoding and code's 
fixation in that era (17,45). In this framework, ancient 
amino acids associated with mRNA directly following 
their intrinsic physico-chemical preferences as outlined 
here. However, the fact that an analogous effect is not 
seen for all bases, especially adenine and uracil, supports 
the possibility that in addition to physico-chemical ration- 
ales in the context of direct binding other evolutionary 
forces were also responsible for shaping the genetic code 
as suggested before (19). Our results are most consistent 
with the possibihty that the early stereo-chemical phase in 
code's development was dominated by G- and C-rich 
codons, as strongest correlations are seen for precisely 
these bases. If the basic structure of the early genetic 
code was defined by such codons, but was later modulated 
by the inclusion of A and U bases, this might explain why 
G-affinity of amino acids in present-day protein sequences 
closely follows PUR density profiles in cognate niRNAs. 
Interestingly, Trifonov and coworkers have suggested that 
the first codons were G- and C-rich on the basis of a con- 
sensus analysis of 40 different criteria (46). 

Importantly, it should be emphasized that the stereo- 
chemical hypothesis of the code's origin may differ from 
the cognate mRNA/protein complementary interaction 
hypothesis in terms of its evolutionary underpinnings. 
Direct templating of proteins from niRNAs in ancient 
systems (the coding aspect of the stereo-chemical hypoth- 
esis) does not necessarily imply that modern proteins 
directly interact with their own mRNA (complementary 
interaction hypothesis). However, our findings support the 
possibility that the origin of the genetic code and potential 
complementarity between proteins and cognate mRNAs 
might have the same physico-chemical background. It is 
well possible that other independent influences have 
shaped both effects, and the two hypotheses leave ample 
room for such refinements. However, we would like to 
stress that in our view, the two hypotheses are inter- 
linked: cognate binding is on the one hand a reasonable 
consequence of the stereochemical hypothesis, but on the 
other hand, it also gives a potential biological rationale for 
the early development of the code to begin with, such as 
stabilization of RNA structures by bound polypeptides, as 
has been suggested before (45). 

There are a number of open challenges concerning the 
aforementioned proposal. First and foremost, the struc- 
tural features of mRNAs and cognate proteins impose 
severe constraints on any putative complementarity 



between the two. Namely, with the contour length of the 
mRNA coding part being ~4.5 times longer than that of a 
cognate protein, it is not clear what structural arrange- 
ments may be consistent with any complementary inter- 
actions. We would like to suggest that structures of such 
complexes may be dynamic and hquid-like with mRNA 
stretches enveloping and solubilizing cognate protein 
stretches (15). Second, with many mRNAs and proteins 
being well-folded and compact for most of the time, it 
remains to be studied when and how opportunities could 
arise for the complementarity between their primary se- 
quences to be of relevance. It is possible that, if at all 
reahstic, such complementary binding might be function- 
ally important precisely in those situations where both 
polymers are unstructured such as during translation, 
export and degradation, as a consequence of thermal 
stress or in the case of intrinsically unstructured 
proteins. However, we do not exclude the possibility of 
complementary interactions even in the folded state. 
Finally, concerning the origin of the genetic code, it is 
not clear how the final well-defined structure of the code 
could have arisen based on still partially non-specific 
large-scale binding interactions between mRNAs and 
cognate proteins. As suggested before, it is possible that 
the answer hes in a combination of different influences 
(19). Future research should shed light on these and 
related questions. 

These challenges notwithstanding, our findings provide 
strong evidence that the abihty to interact with mRNA 
might be a widespread phenomenon in the cell involving 
not only cognate proteins but also other proteins based on 
similar principles. The potential significance of such 
physico-chemical complementarity between mRNAs and 
proteins potentially extends to all facets of nucleic acid 
and protein biology in the modern cell including transcrip- 
tion/translation regulation (9,10,47,48), mRNA transport 
and localization (49,50), processing and decay (51), struc- 
ture of ribonucleoproteins (52) and others (2-5,53,54). 
Our preliminary GO analysis has demonstrated a signifi- 
cant enrichment of functions related to association with 
nucleic acids for the subsets of proteins that complement 
their cognate mRNAs strongly, and these findings will be 
explored in more detail in future work. 
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